24

Quantization of Neural Networks

Quantize MLP

Quantize Weights in MHSA

Quantize Activations in MHSA

Fully Quantization

DeiT-Small

DeiT-Base

75.64

68.87

68.03

78.72

71.97

70.25

81.8

78.12

79.9

80.23

FIGURE 2.5

Analysis of bottlenecks from an architecture perspective. We report the accuracy of 2-

bit quantized DeiT-S and DeiT-B on the ImageNet data set to replace the full precision

structure.

only a drop of 1.78% and 4.26%, respectively. And once the query, key, value, and attention

weights are quantized, even with all weights of linear layers in the MHSA module in full

precision, the performance drops (10.57%) are still significant. Thus, improving the attention

structure is critical to solving the performance drop problem of quantized ViT.

Optimization bottleneck. We calculate l2-norm distances between each attention weight

among different blocks of the DeiT-S architecture as shown in Fig. 2.6. The MHSA modules

in full-precision ViT with different depths learn different representations from images. As

mentioned in [197], lower ViT layers pay more attention to global representations both

locally and globally. However, fully quantized ViT (blue lines in Fig. 2.6) fails to learn

accurate distances from the attention map. Therefore, it requires a new design to use full-

precision teacher information better.

2.3.3

Information Rectification in Q-Attention

To address the information distortion of quantized representations in forward propagation,

we propose an efficient Q-Attention structure based on information theory, which statisti-

cally maximizes the entropy of the representation and revives the attention mechanism in

the fully quantized ViT. Since the representations with extremely compressed bit width in

fully quantized ViT have limited capabilities, the ideal quantized representation should pre-

serve the given full-precision counterparts as much as possible, which means that the mutual

information between quantized and full-precision representations should be maximized, as

mentioned in [195].

We further show the statistical results that the query and key value distribution in ViT

architectures intended to follow Gaussian distributions under distilling supervision, whose

histograms are bell-shaped [195]. For example, in Fig. 2.3 and Fig. 2.7, we have shown the

query and key distributions and their corresponding Probability Density Function (PDF)

using the calculated mean and standard deviation for each MHSA layer. Therefore, the

query and key distributions in the MHSA modules of the full-precision counterparts are

formulated as follows.

q ∼N(μ(q), σ(q)),

k ∼N(μ(k), σ(k)).

(2.18)

Since weight and activation with a highly compressed bit width in fully quantized ViT

have limited capabilities, the ideal quantization process should preserve the corresponding